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Abstract 

A central question in information theory is to determine the maximum success probability that 
can be achieved in sending a fixed number of messages over a noisy channel. This was first studied 
in the pioneering work of Shannon who established a simple expression characterizing this quantity 
in the limit of multiple independent uses of the channel. Here we consider the general setting with 
only one use of the channel. We observe that the maximum success probability can be expressed 
as the maximum value of a submodular function. Using this connection, we establish the following 
results: 

1. There is a simple greedy polynomial-time algorithm that computes a code achieving a (1 — 
e~ ^-approximation of the maximum success probability. The factor (1 — e -1 ) can be improved 
arbitrarily close to 1 at the cost of slightly reducing the number of messages to be sent. Moreover, 
it is NP-hard to obtain an approximation ratio strictly better than (1 — e -1 ) for the problem of 
computing the maximum success probability. 

2. Shared quantum entanglement between the sender and the receiver can increase the success 
probability by a factor of at most 1 _^_ 1 . In addition, this factor is tight if one allows an arbitrary 
non-signaling box between the sender and the receiver. 

3. We give tight bounds on the one-shot performance of the meta-converse of Polyanskiy-Poor- 
Verdu. 


1 Introduction 

One of the central threads of research in information theory is the study of the minimum error proba¬ 
bility that can be achieved in sending a fixed number of messages over a noisy channel. This task can 
be formulated as the maximization—over valid encoders and decoders—of the probability of correctly 
determining the sent message (which we refer to as success probability for the rest of the paper). In his 
seminal work. Shannon l22l showed that for n independent copies of the channel, this question is almost 
completely answered by a single number, C, the capacity of the channel. In fact, for a number of mes¬ 
sages k(n) satisfying sup n 1 ° s ^ ra ' 1 < C, the maximum success probability tends to 1 as n tends to infinity, 
and when inf n log ^ n ) > C, the maximum success probability tends to 0 as n tends to infinity Il28l . 

Here, we study the algorithmic aspects of determining the optimal encoder and decoder which lead 
to the maximum success probability over a noisy channel in the non-asymptotic regime. Recently, in the 
information theory literature there has been significant interest in understanding the non-asymptotic 
behavior when the number of channel uses n is finite 11231 IT0Hl8l 1251 . But, instead of focusing on closed- 
form expressions for the maximum rate at which information can be transmitted, we rather ask how well 
can the optimal rates of transmission be computed with an efficient algorithm. One way to formulate 
the computational problem is as follows. The input of the problem is an integer k (which denotes the 
total number of messages) together with the description of a channel W that maps elements in X to 
elements in Y ; specifically, for each y £ Y and x € X, we have W(y\x), which is the probability of 
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receiving the symbol y when the symbol x is transmitted. The objective is to determine the maximum 
success probability^] S (W,k), for transmitting k distinct messages using the channel W (only once). 
This algorithmic formulation leads us to interesting implications that are described below. We refer to 
this problem of determining the maximum success probability of a given noisy channel as the optimal 
channel coding problem. 

Our more specific findings can be summarized as follows: 

• We observe that the problem of computing the optimal success probability S(W, k) corresponds to 
a submodular maximization problem with cardinality constraints (Proposition l2.ll , which implies 
that this quantity can be efficiently approximated using a simple greedy algorithm that achieves a 
(1 — e _1 )-approximation ratio. As the maximum coverage problem can be reduced to the optimal 
channel coding problem, we also find that it is NP-hard to approximate S (W,k) within a factor 
larger than (1 — e^ 1 ). 

• The natural linear programming relaxation of the optimal channel coding problem is well-studied 
in the information theory literature under a different name. It corresponds to the well-known 
meta-converse of Polyanskiy-Poor-Verdu (PPV) [|T8I which puts an upper bound on the maximum 
success probability of sending k messages]^] Matthews Ifl4l showed that this linear programming 
relaxation corresponds to the maximum success probability, S NS (W, k ), when the sender and the 
receiver are allowed to share any non-signaling box, i.e., a (hypothetical) device taking inputs from 
both parties and generating outputs for both parties in a manner that makes it, by itself, useless 
for communication. Our main finding is an upper bound on the integrality gap for the linear 
programming relaxation of the optimal channel coding problem (Theorem |3.1j l. In particular, for 
any channel W and integers k and l, we have: 

e~ e/k ) S NS (W, k) < S (W,£) < S ns (W,£) . (1) 

When i = k, this inequality says that the ratio of the optimal success probability and the meta¬ 
converse is at least (1 — e~ 1 ). More generally, if a better guarantee on the success probability 
is desired, this can be achieved at the expense of taking a number of messages £ that is slightly 
smaller than k. For example, if we take k = 2 1 and £ = y, we obtain after simplification 

(f - S NS (W, 2 t ) < S(W j) < S NS (W, j) . (2) 

The bound (]T|) can be seen as a bicriterion upper bound on the integrality gap, highlighting the 
tradeoff between the two important parameters: success probability and number of messages. We 
note that it is important for our applications to analyze the linear programming relaxation and not 
only understand the performance of the greedy algorithm for the optimal channel coding problem. 
We give two algorithmic proofs of this result. The first one by analyzing the greedy algorithm 
which can be done by combining a result of ©GEi together with an important observation that can 
be found in [T3j Theorem 1.5]. We provide a self-contained elementary proof of the result. In the 
second proof, we analyze the coding strategy which is commonly used in achievability bounds: a 
random code chosen according to a distribution given by the meta-converse. This second analysis 
is done using standard randomized rounding techniques. Moreover, using a family of examples 
from (T| gives a family of channels for which the guarantees in 0 are tight. 

1 In this paper, we focus on the success probability on average over the k messages. 

technically, the PPV bound gives an upper bound on the number of messages for a desired success probability. It is 
however simple to adapt it to maximizing success probability for a fixed number of messages. The reason for the usefulness of 
the meta-converse is that it can be analytically evaluated for many settings of interest in particular for n independent channel 
uses for non-asymptotic values of n; see H6ll24l and references therein for an overview of the active area of finite blocklength 
analysis. 
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• As quantum entanglement cannot be used for signaling, the inequality Q puts a limit on the 
usefulness of entanglement between the sender and the receiver for the problem of coding over 
a noisy (classical) channel. The fact that entanglement could improve success probabilities was 
highlighted by [2llf7l. In this paper we in fact obtain an explicit upper bound on the improvement 
that can be achieved via entanglement. 

The bound Q setting l = k addresses a question asked by Hemenway et al. BT2l who proved that 
for the special case of transmitting a single bit, i.e., k = 2, the ratio of the quantum advantage to the 
classical advantage is at most 2^ For an explicit generalization of their result to arbitrary values of 
k see inequality ( |2T) , which is a consequence of our main theorem. 

1.1 Outline 

In Section [2] we establish that the optimal channel coding problem corresponds to submodular maxi¬ 
mization with cardinality constraints. Though, in and of itself, this connection is direct, it provides a 
useful starting point. In particular, we extend this connection in Section[3]to obtain interesting implica¬ 
tions for channel coding with non-signaling boxes and in particular quantum entanglement. 

2 Optimal channel coding as a submodular maximization problem 

Given a noisy channel W whose input and output alphabets are X and Y respectively, along with an 
integer k, our aim is to build an encoder and a decoder that can transmit k distinct messages with 
the smallest error probability averaged over the messages. We can describe a (possibly randomized) 
encoder e taking as input message i G [A;] and mapping it to x G X with probability e(x\i). Similarly, a 
decoder d takes as input some y G Y and outputs i G [k] with probability d(i\y). The maximum success 
probability S(W, k) of sending k messages using the noisy channel W can be written as the following 
optimization program over encoding and decoding maps. 

S(TF, k) = f maximize \ y e(x\i)W(y\x)d(i\y) 

e,d k ' 

x,y,i 

subject to y e(x\i) = 

X 

y d (i\y) = 

i 

0 < e(x\i) < 

0 < d(i\y) < 

The next proposition is a simple but important observation that the problem described in ([3]) is of a 
very well-studied type: maximizing a submodular function subject to a cardinality constraint. 

Proposition 2.1. Let W be a channel, with input alphabet X and output alphabet Y, and k > 1 be an integer. 
Then we have 

S(W, k) = ~r max fw{S) where fw '■ 2 X —> M+ is defined by fw(S ) = y max W(y\x) . (4) 

rC SCX,\S\<k ,, x£S 

yeY 

Moreover, f is monotone and submodular, i.e., for any S C T c X and x <fT, 

Monotone: fw(T ki {x}) > fw(T) (5) 

Submodular: fw{S U {x}) — fw(S) > fw(T U {x}) — fw(T) ■ (6) 

3 Note that the case k = 2, the quantity S(W, 2) can be written in a very explicit form as a function of the maximum total 
variation distances between all pairs of distributions W(.\xi) and W (-1*2), see lfT2l Proposition 1] for more details. 


. Vi G [k] 


L Vy G Y 


( 3 ) 


L V(x, i)Glx [A:] 

L V(i, y) G [ k} x Y . 
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Proof The monotonicity of fw is clear. For the submodularity, let S C T C X. For any u T, 

(f w (Su{u}) - f w (S )) - (f w (Tu{u}) - f w (T )) (7) 

= ) max W(y\x) — max W(y\x) — ( max W(y\x) — max W(y\x)) ) . (8) 

yxSSU{ti} S i£TU{u} x£T J 

For each y G Y, we distinguish two cases. If max xgTU { u j W(y\x) = W ( y\u ), in which case the expression 
reduces to max^gy W(y\x) — niax J: c l s' W(y\x) > 0. The second case is that the maximum is achieved in T, 
i.e., max xeTuW W(y\x) = W(y\x) for some x G T . In this case max^^^yi^j. ID(?/|x) — max^gj 1 ID(?/|x) — 0 
and the above expression is clearly non-negative by monotonicity of /. 

We now show that S(W, k) > \ max5cx,|S|<fc fw{S). For this, choose S C I of size £ < k (as / 
is monotone, we can assume that maximum is attained for some S of size exactly min(/c. \X\)). Then 
arbitrarily order the elements in S = {x i,..., xg}. Define e(x,;|?'j = 1 for all i G [(] and set all the other 
variables e(.|i) for i G [£} to zero. For i G {£ + 1,..., k}, set e(x|i) in an arbitrary way that satisfies the 
normalization constraint. Then, for any y define m(y) to be i G [£] such that W(y\xi) is the maximum (in 
case of multiple % s with the same maximum value, choose the smallest i). Then set d(m(y)\y) = 1 for all 
y Gb and zero for all other entries in d. Clearly e and d satisfy the constraints. Moreover, 

\ w (y\x)e(x\i)d(i\y)>^ ^ W(y\xi)d(i\y) 

i€[k],x,y 

= W(y\ x m( y )) 
y 

y 

which leads to the desired result. 

For the other direction, if we define x, to be the symbol that maximizes W ( y\x)d(i\y ), we have 

^2e(x\i)^2w(y\x)d(i\y) < max ^ W(y\x)d(i\y) (12) 

i,x y i y 

= ££ W(y\xi)d(i\y) (13) 

i y 

< maxVF(ylxi) (14) 

y i£ [ fe ] 

< max > max W(y\x) . (15) 

~ scx,\s\<k^ xeS 

□ 

It follows from the proof above that any algorithm computing an optimal S in Q can be easily 
transformed into an algorithm computing optimal encoding and decoding functions e, d in (|3]>, and vice 
versa. For this reason, we interchangeably talk about the code S and the encoding-decoding pair (e, d). 

We note that the expression (|4]) is well-known in information theory and comes from taking a maxi¬ 
mum likelihood decoder, see e.g., |26. Section III], It is also worthobserving that log fw{S) = Ioo(Us ■ Y) 
where the joint distribution of (Us, Y) is defined by Pu s y(x, y) = ^ W(y|x) if x G S and zero otherwise, 
loo is the mutual information of order oo (see fl27| for a discussion of a-mutual information). 

Using the notable result of Ill5l , the formulation in Q immediately shows that the quantity S(IU. k) 
can be approximated efficiently within a factor of (1 — e -1 ). In fact, this can be achieved with a very 
simple greedy algorithm. Starting with set Sq = 0, subset Se+i C X is constructed from subset Sg C X 


( 9 ) 

( 10 ) 

( 11 ) 
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by adding an element x^ + i that maximizes fw(Se U {x£+i}), so that S^ + i = Se U {x^ + {\. Let S greedy (W, k) 
be the success probability obtained by the greedy algorithm on input W and k. With the notation used 
above and the function f\y defined in Q, we have S greedy (W. k ) = \fw{Sk)', here Sk is the size-fc subset 
computed by the greedy algorithm. 

Corollary 2.2. For any channel W and any k, 

(1 - e -1 )S(W, k) < S greedy (W, k) < S(W, k) . (16) 

Moreover, if there is a polynomial-tune algorithm that takes as input W and k and returns a number Alg(W, k ) 
satisfying for some e > 0 and all inputs W, k the inequality (1 — e -1 + e)S(FF, k) < Alg(FF, k) < S(W, k), then 
P = NP. 

Proof The fact that the greedy algorithm provides a (1 — e _1 ^approximation algorithm for S follows 
directly from (TQ. In Section [3} we provide a proof of a strengthening of this result. 

For the hardness of the problem, we use the hardness of the maximum-/c-coverage problem flSJ. In the 
maximum--coverage problem, we are given a collection of sets {T r } J;e Y of elements in Y (i.e., T x C Y 
for each x e X) and the objective is to find a subset S C X of size k such that | Lings' T x \ is as large as 
possible. Feige f8] showed that this problem is hard to approximate with a factor better than (1 — e“ 1 ). 
In fact, as highlighted in HI, the problem is still hard if the sets T x all have the same size, call it d. Given 
such an instance we define the following channel W(y\x) = ^ if y G T x and 0 otherwise. Then for any 
choice S C A, we have | U xG sT x \ = rnax xe s W(y\x) ■ d = d- f\y(S). This shows the desired result. □ 


3 Channel coding with non-signaling boxes 

A key motivation for this work is to understand the advantage that can be achieved when the sender 
and the receiver share additional resources that are by themselves useless for communication. Such 
resources are commonly called non-signaling boxes Ifl9l . The simplest example of a non-signaling box is 
a device providing shared randomness between the sender and the receiver. It is quite simple to see that 
allowing the encoder e and decoder d to depend on some shared randomness will not affect the value 
of <(3j. However, if the sender and the receiver share entanglement, we know that for some channels, 
a success probability S^(FP, k) that exceeds S( W. k) can be achieved EDO. This is analogous to a Bell 
inequality violation |5j, or in other words the fact that the entangled value of a 2-prover 1-round game 
can be larger than the classical valuej^] 

A natural question to ask here is how much can entanglement (or, in general, a non-signaling re¬ 
source) between the sender and receiver help for reliable transmission. For example, is there a choice 
of channel for which the success probability with entanglement is close to 1 and without entanglement 
is very small? Our main result shows that this cannot be the case and that the ratio s c 'qwk) — 1 ~ e _1 . 

As S^(W, k) does not seem to be easy to analyzej^jit is helpful to consider even more general resources 
between the sender and the receiver. Allowing for any non-signaling box between the sender and the 

4 In fact, the problem of optimal channel coding that we are studying can be seen as a kind of game. The input of the sender 
is the label i of the message and his output is an element x G X. The input of the receiver is some y £ Y and his input is a label 
j of some message. The difference is that the way one would normalize for a game is different than the way we do it here. 
Also, in our setting, the referee's output is not necessarily 0 or 1 but rather a utility for each input-output pair that depends on 
the channel probabilities W(y\x). 

3 We only know of a hierarchy of semidefinite programs that converges to this value (4). 
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receiver leads to a very natural linear programming (LP) relaxation of ([3]). 


S NS (W,/c) = maximize - W(y\x)r x v 

r x ,y,Px k 

x,y 

subject to ^ r x , y <1 Vy G Y 

X 

XI Px = k 


( 17 ) 


r x ,y < px V(x, y) & X xY 
0 < r XtV ,p x < 1 \/(x,y) e X xY. 

Matthews 1141 showed that S NS (W, k) corresponds to the maximum success probability that can be 
achieved if the sender and the receiver are allowed to use any additional non-signaling resource (see 
also m Section II.B] for another explanation of this fact.) This means that they can use any additional 
box as long as it does not allow the sender and the receiver to send information to each other; see fl9l 
and references therein for a review on the stu dy o f non-signaling boxes. For the convenience of the 
reader, we repeat the proof of lH4l in Appendix aF 


As pointed out in Ifl4l . a form of the LP ( |T7| | is widely known in the information theory literature as 
the Polyanskiy-Poor-Verdu meta-converse fl8l . The PPV meta-converse gives an upper bound for the 
number of messages that can be sent through a channel in terms of some hypothesis test; a connection 
which also appeared in ItTTTl . In Appendix [Bj we basically repeat the argument of Ifl4l to show how to 
interpret the LP ( jl7| ) in terms of a hypothesis test. 

Our main result shows that the LP relaxation ( fTZ) l cannot be too far from the maximum success 
probability in <(3j. 


Theorem 3.1. Let W be a channel. Then, for any integers £, k € {1, ..., |X|} we have: 




More precisely, this can be achieved via the greedy algorithm 


S m (W,k ) . 


(18) 


ggreedy^) > ^ _ I^ S NS (TT, k ) , 


(19) 


or via random coding 


p-M) r 8 ^)’ 


( 20 ) 


where S C X is obtained by choosing i elements of X independently according to the distribution {^f } x ex, where 
p x is an optimal solution in © and fw is as defined in Proposition \2.1\ 

Figure [l] gives an illustration for the statement of the theorem. 

6 Note that we also have a small modification compared to the LP of (14l in that we require p x < 1. We can safely add this 
constraint provided k < \X\, which we assume throughout this section. 
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Figure 1: This plot illustrates the maximum success probability as a function of the number of messages 
to be sent. The top curve corresponds to the maximum success probability if any non-signaling boxes 
between the sender and the receiver is allowed. The bottom one corresponds to the setting where the 
sender and the receiver do not share any additional resources. Theorem [3d] states that for any value l of 
the number of messages, the ratio between the non-signaling success probability and the usual success 
probability is at most 1 — e -1 . It also gives a way of comparing the two values for different number of 
messages to be sent. 


Comments on the proof As mentioned in the theorem, we give two proofs of this result. The first one 
by relating the performance of the greedy algorithm to the linear program (l7| . The l = k case of this 
theorem can be proved by using a result of @ relating the performance of the greedy algorithm and 
the linear programming relaxation for the location problem. In fact, the expression in Q shows that the 
optimal channel coding problem can be written as a location problem. To obtain the tradeoff between 
success probability and number of messages, we use the observation about the greedy algorithm that 
can be found in 


Theorem 1.5]. A complete proof of this theorem appears in Section 3.1 The second 
proof proceeds by standard randomized rounding techniques and can be found in Section |3T2] 

Inequality ( fT9] > (for £ = k) together with the fact that both S greedy (fl 7 , k) and S NS (W, k) are computable 
in polynomial time might seem at first in contradiction with the NP-hardness in Corollary |2.2| But this 
is not the case because it is unclear how to use the linear program S NS (W, k) to obtain a lower bound on 
S {W,k) that is better than the greedy algorithm. Another consequence of ( fl9] > is that it proves that the 
greedy algorithm gives a (1 — e -1 ) approximation even for the maximum success probability S^(W, k) 
using entanglement. 

Another relevant observation concerning the proof is that we can in fact obtain a multiplicative 
bound between centered quantities. As S(W, k) > \ for any channel W (the decoder can just randomly 
guess a message), it might be useful to consider the ratio S(FF, k) — l k to S xs (fT. k) — in particular for 
small values of k. A simple modification in the proof leads to the bound: 




fc -1 





( 21 ) 


This inequality generalizes the bounds obtained by Hemenway et al. Id2i , who considered the case 
k = 2. 


Comments on the ratio The pre-factor in the right hand side of ( |18| can be simplified using the follow¬ 
ing inequalities 



>±n-e 


-e/k 


( 22 ) 
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and when l/k -C 1, a good approximation is given by the following inequality 

( 23 ) 

The bound shows, for example, that if it is possible to send log k bits of information with a success 
probability S xs (lid k) = 1 — e using non-signaling boxes, then it is also possible to send log k — 10 bits of 
information with a success probability of at least 0.998 • (1 — e) without using any additional resources. 
We show in Section [33] that the bound ( |I8| ) is tight. 


Multiple independent uses of a channel It has been known for a long time that asymptotically for n 
channel uses with n —> oo, entanglement cannot increase the capacity C (W) of a noisy classical channel 
W IHIll. In fact, this is easily recovered from Theorem 3.1 
signaling box, i.e., S NS (W® n , R n 
using Theorem |3.1| with the 


Let R be a rate achievable using a non- 
1 as n —> oo. We will show that R < C(W). In fact, let S > 0. Then 


S(W 0n , (R{ 1 - 5)) n ) ) S NS (W 0n , R n ) 


(24) 


and thus S(W 0n , R n ( 1 — 5) n ) —> 1, which shows that R( 1 — (5) is an achievable rate for the channel W 
(without using any non-signaling boxes). By definition of the capacity C(TF), this means that Il{ \ — S) < 
C (W). As this holds for any 5, this shows that R < C (W). 


3.1 Proof of Theorem |3.1| via the greedy algorithm 

In comparison with Corollary |2.2} we prove a stronger result relating the performance of the greedy 
algorithm to the value of the linear programming relaxation. For that, it is useful to define the following 
extension of the function fw defined in ([4]) to fractional vectors p E [0,1] v l, 

W(y\x)r x , v 

x, y 

y, r Xt y <1 Vy eY ( 25 ) 

X 

o < r X} y < p x V(x, y) ex xY. 


fw{p) — maximize 

r x,y 

subject to 


With this notation, we can write 


S NS (W,k) = ^ max f w (p) ■ (26) 

k p x > o 

Y, x Px= k 

The following lemma is crucial in proving Theorem |3.1| 

Lemma 3.2. Let S Cl and vector p e [0,1]I A 4 such that Yh x Px = k. Then 

fwip) < fw(S) + k(m&x f w (S U {x}) - f w (S )) 

x£\ 

Proof Define for any x E X, q x = max{/>,,. l x as- 0}. Note that q x > p x for all x, and so fw{p) < fw(q)- 
We aim to show the stronger statement fw{q ) < fw(S ) + Mniax^gx fw(S U {x}) — fw(S)). Let r x , y 
correspond to an optimal solution for the program of fw{q)/ i n particular fw(q) = y W(y\x)r X:V . We 
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have 


fw(q ) - fw{S ) = ^Ph( 2 /|x)r Xiy - ^maxPh(y|x') 
x,y y 

= Y ( W (y\ x ) r *,v ~ m|xVh(y|x') J 

< _ E r ^)™ts wfoi*') + 5Z 

2 / \x£S x&S x$lS 


(^2 r x,y) 

x^S 


max W(y\x') 
x’£S 


(27) 

(28) 
(29) 


using the fact that J2x r x,y < 1- Now observe that J2xeS W (v\ x ) r x,y ~ (X ] x eS r x,y) max x / eS W{y\x') < 0 

and thus 


fw(q) ~ fw{S ) < Y 

y 



E r *,v) 

X$lS 


max W (y\x') 


-£5> y(^(2/|x) 

a^S 2/ 


max W(y\x')) 
x'eS 


sE E % y (W{y\x) — maxfh(y|x / )) , 

x£S y£r(x) 


(30) 

(31) 

(32) 


where T(x) = {y : W(y\x) — max x / e s W(y\x') > 0}. For x S, we have q x = p x and thus r x , y < p x - As a 

result. 


fw{q) ~ fw(S) <Ypx Y (W(v\ x ) ~ maxVF(y|x')) (33) 

' x'eS 

a;^S j/er(a:) 

<fc- (W(y|x*) — maxFF(y|x / )) , (34) 

x'eS 


where x* S maximizes the quantity J2 y er(x)(W(y\x) — max^igg W(y\x')). Now observe that by defi¬ 
nition of T(x*), we have 


fw(SU{x*})-f w (S)= Y (W(y\x*)-rn^W(y\x')). (35) 

x'GS 

2 /e r(s*) 

Combining this with |34| , we get the desired result. □ 

Proof [of Theorem |3.1[ Eq. ( |I9| )] Using Lemma |3.2[ we can apply the framework of Ifl5l for analyzing the 
performance of the greedy algorithm. Recall the notation introduced for the greedy algorithm: starting 
from So = 0/ S^-t-i is constructed from Se by adding an element Xf +] that maximizes fw(Se U {x ^ + i}), 
so that Si + 1 = Sf IJ {xf + \ }. Fix an integer zq G {(),.... |X|} (think of zq = 0, but it will also be useful to 
choose *o = 1). We prove by induction on i that for any l > ?'o 


max f w (p) - fw{Se) < 

p 



(^maxf w (p) - f w (S, 




(36) 


where the maximizations are taken over all p such that p x > 0 and J2 X Vx = k. The base case (’ = zo is 
clear. Using Lemma pL2| together with the fact that fw(Se+i) = max x * e x fw(Si U {x* }) gives 

max f w (p) - fw(Se) < k ■ (f w (S e + i) - fw(St)) • (37) 

p 
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Rearranging the terms we see that 


ma xf w {p) - f w (Se+i) < 
v 

< 



( max f w (p) - fw(Si) 


\ p 
l+l-io 


( maxf w (p) - f w (S, 

\ p 


IQ, 


using the induction hypothesis. For zq = 0, this gives can be written as 


(38) 

(39) 


fw(Si) > ( 1 — ( 1 — — ) ] ma xfw(p) 


Dividing by £ and using ([261, we have 


s (W,£) > s = jfw{S t ) > ^1 - (l - ma xfwip) 


=^1-1- 


k 


S NS (W,k) . 


This concludes the proof of the theorem. To prove (|2T|), take io = 1, observe that fw(Si) 
l = k, d36]> becomes 


maxp f w (p) - fw(Sk) < L _ 1 


maxp f w (p) - 1 

Rearranging the terms, we get the inequality 


fc-i 


(40) 


(41) 

(42) 


1 and set 


(43) 

□ 


3.2 Proof of Theorem |3.1| via random coding 

Proof [of Theorem |3.1[ Eq. ( [20| ] Recall that S is obtained by choosing £ independent samples from the 
distribution We have 

We will show that for any y <EY r E {max^s W (y|x)} > (l ~ (l ~ l/) w (v\ X ) r x,y 

In order to show this, we order the inputs {xi ,..., £|x|} ^ ^ by decreasing order of W ( y\x ), so that 
W{y\xi) > W(y\x 2 ) >■■■> W(y\x\ x \). We may then write 

m 

= V' P {xi ^ S n • • • n x ^i ^ 5 n Xj 6 S'} IT (y\xi). 

i=i s 

Note that we used the convention that xo e 5 always holds. Observe that 

P {x\ £ S n • • • n xi -1 ^ S’ n Xi g S'} = P {xi e S' u • ■ ■ u x, g 5} - P {xi g S u • • • u xi_i g S'} . 

Thus, we obtain 

n Dl-i 

W(y\x) = P {xi e S U • • • U ^ G S} (W(y\xi) - W(y\x i+1 )) + W(y\x ]x] ) . (44) 

' i =1 


E < max 

S [ xes 


E < max W(y|x) 
S I x£S 
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Now we can find a lower bound on P {x\ E S U • • • U Xi E 5} using the variables p x and r XyV of the linear 

S 

program. In fact. 


P {xi S SU--' Uxi G 5} = 1 - ( 1 - 
s 


Px 1 H-h p Xi 

k 


> x _ | j _ r ^i ,y + ~ ~ ~ + r xi,y 
k 


> ^ ^ ( r xi,y + ' ' ' + r Xi,y) j 


where the first inequality comes from the constraint r x , y < p x and the last inequality from the constraint 
r x,y — 1 an d the concavity of the function z h->- 1 — (1 — |) . Going back to ( 144] ), we obtain 

E{maxlP(y|x)} > ^1 - (l - j (r Xuy + • • • + r XiiV ){W(y\xi) - W(y\x i+X )) + W(y\x\ x \) (45) 


> 1 - 1 - 


k 


A IM 


^r Xuy W{y\xi) , 


(46) 


i=l 


where we used again in the second inequality the constraint that Yl x r x,y < 1. This proves the claimed 
inequality and thus by summing over y E Y, we obtain 




S NS (W, k) 


□ 


3.3 Tightness of Theorem |3.1| 

We now prove that the result shown in Theorem |3.1| is tight using a simple family of graphs proposed 

in EE). Consider the following channel for k,t> 1 integers. The input alphabet X is composed of n = f kt 
symbols and the output alphabet Y is composed of (”') symbols that we interpret as subsets of X of size 
t. On input x, the output of the channel is a randomly chosen y such that x E y. 

{ /n l n if x E y 

( t -i) (47) 

0 otherwise. 


Note that, interestingly, the case k = t = 2 is exactly the channel that is studied in If21~| , in which it is ex¬ 
perimentally demonstrated that entanglement assistance can help in improving the success probability 
for sending a bit over this channel. 

We first show that S NS (W, k) = 1. For this let p x = for all x E X and r XtV = k if x E y and r xyy = 0 
otherwise. We have Yh x r ' x x = t k = 1. Moreover, 


x,y y xEy \t— 1 / 


i("Vt ** = i. 
k w ("_ i ) n 


(48) 


Using the symmetry of the channel it is simple to determine S(W, £) for any 1 < l < n. In fact, one can 
see that fw(S) only depends on the size of S, let Sf be any fixed set of size £: 


S(W,£) = -m S f w (S) 

l |o|— t 


- y 
£ ^ 

y£Y-.ynS e ^H 



(49) 
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So we only need to count the number of subsets y that intersect with the set Sg of size £. This number is 
given by (”) — ( ra “^). Observing that (™Zi) = T (") and t/n = 1/Awe have 


S(W,£) 


k 

1 




k ( (n —t) ■ ■ ■ (n — £— t + 1) \ 
£ \ n(n — 1) ■ • • (n — £ + 1) / 



(50) 

(51) 

(52) 


From this expression, we see that for example by fixing £ and k to be constants and letting t —> oo. This 
expression approaches 



which exactly matches the bound in Theorem|3.1| 


4 Discussion 

The main message of this work is to draw a connection between the study of optimal coding for noisy 
channels in information theory and algorithmic aspects of submodular maximization. We believe this 
connection could be fruitful in both directions. As we showed in this paper, techniques developed in the 
context of submodular maximization—and, in general, approximation algorithms—can have interesting 
implications when analyzing the problem of optimal channel coding. We believe there are many more 
relevant applications to be explored. A particular question is whether algorithmic techniques can be 
helpful in obtaining better finite-blocklength bounds for well-studied channels such as the ones given 
in |fl6l . 
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A Non-signaling assisted channel coding 

Suppose that the sender and the receiver have a box shared between them with the following properties. 
Alice inputs a and receives output a and Bob inputs (3 and receives output b. We say that such a box 
is non-signaling if by itself it is useless for communication. More formally, such a box is described by a 
conditional probability distribution P(a, b\a , 8) representing the probability that the outputs are a and 
b given that the inputs are a and 8- The non-signaling property is then easily formulated as a linear 
constraint on these numbers: the marginal distribution on a is independent of the input /3 of Bob and 
the marginal distribution on b is independent of the input a of Alice 


P(a\a, 8) = f ^2 b\ a -> @) = Pa{cl\o) 

b 

(54) 

P(b\a , 8) = E 6 I«’ & = P ^ b \P ’ 

(55) 


a 
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for some conditional distributions Pa and Pb- In other words, P(a\a,j3) is the same for all values of 
/3. Perhaps the simplest example of a non-signaling box is one that provides shared randomness. In 
this case P(a, a'\a, /3) = a q a '-. Also, if Alice and Bob share a quantum state (which could be entangled) 
and perform measurements that depend on their inputs, the outputs being the measurement outcomes, 
this also defines a non-signaling box. There are also some distributions that are non-signaling but do 
not seem to be physically realizable without communication, the most well-known being the Popescu- 
Rohrlich box Il20ll . 

Now assume that the sender and the receiver share such a box. The sender may give an input 
depending on the message he wishes to send and use the output he receives from the box to choose the 
symbol x £ X. Similarly, the receiver can give an input (3 that depends on the symbol y he receives 
and might use the output to decode the message. By encompassing all pre- and post-processing of the 
sender and the receiver into the box itself, we can assume that Alice inputs the message i into the box 
and the output of the box is exactly the input to the channel. Similarly, Bob inputs y into the box and 
receives j, a candidate for the sent message. 

Given this definition, we can naturally define the non-signaling success probability as 


S NS (W, k) = f maximize - W(y\x)P(x, i\i, y) 

P{x,j\i,y),P A ,P B k 

subject to ^ P(x, j\i, y) = Pa(x\i) Vi, x,y € [k] x X x Y 
j 

^2 P(x, j\i, y) = PbU\ y) Vj,i,y€ [k] x [k] x Y 

X 

^2P{x,j\i,y) = 1 Vi, y e [k]xY 
o < p(x,j\i, y) Vj, i,x,y E [k\ x [k\ x X xY . 


(56) 


We here prove that this linear program and the one in ( |T7| have the same value. 

Given a box P , let us construct a feasible solution for Let r x<y = ^2iP(x,i\i,y) and p x = 

Pa{x\i). Then clearly Y JX r ^,y = Y,i P B{i\v) = 1 and Y, x Px = Y, x ,i P A(x\i) = k. We still need to 
show that we can assume p x < 1. For this, define p’ x = min{p X: 1}. As r XJJ < 1, we still have r x . y < p' x . 
But the sum J2 X P X might be less than k. But as A: < \X\, there exists p x > p' x while keeping p x < 1 
satisfying J2 x Px = k. The pair ( r X:V ,p x ) is then a feasible solution for (l7| with objective function equal 
to li2 x , y ,iW{y\x)P(x,i\i,y). 

For the other direction, define 


P(x,j\i,y) 


k 

Px p x,y 

fc(fc-ij 


if i = 3 

if 't + 3 ■ 


It is simple to see that this distribution defines a non-signaling box. 


(57) 


B Interpreting the LP in terms of hypothesis testing 


Consider two distributions P and Q over the set Z. Given a sample from either P or Q, we wish to 
determine which distribution generated the sample. One can define a randomized test T : Z -> [0,1] 
where T ( z ) is the probability of declaring the distribution to be P. An important quantity that is studied 
in statistical hypothesis testing is the smallest probability to falsely outputting P among all tests that 
correctly identify P with probability at least a: More precisely. 


P a (P,Q) 


min 

T:Z-*[0,l},j: z P(z)T(z)>a 


J2q( z ) t w- 


(58) 


13 




We will see that the quantity S ns (VF, k) is related to 3 a (P. Q) for distributions P and Q constructed 
from the channel W. More specifically, let p 6 D(X) be a distribution on inputs X and u G D(y) 
be a distribution on outputs Y. Then let the distribution p ■ W on X x Y be defined by probabilities 
p{x) W(y\x) and p ■ u be the distribution defined by probabilities p{x)v{y). Then the maximum success 
probability S NS (W, k) can be interpreted as a distance between the product distribution p • v and the 
distribution induced by the channel p • W in the following sense: 


Proposition B.l. 

1 — S NS (W ,k) = min max .{p- v,p-W) . (59) 

V ' /iGD(X) veD(Y) 1 ' y 

Proof Given a feasible solution p x and r x , y for (T7| , we define a distribution p(x) = p x /k and a test 
T(x,y) = 1 — r y L . The constraints in ( [17] | readily give Yh x P( x ) = 1 and T(x, y) G [0,1] for all x, y. We 
then have that for any distribution u on Y, 


In addition, we have 


v(x)v{y)T( x ,y) > min^-A • (1 - —) 
y k p x 

x,y x 


= min 1- 

„ h 


x,y 


~ 1 k ' 


^/i(.x)W(y|x)T(x,y) = 1 - jj- ^ r^W(y|x) 

£,2/ x,y 

> l-S NS (W,k) . 


(60) 

(61) 

(62) 

(63) 

(64) 


As a result, there exists p such that for all v, /3i_i /k(p • v, p ■ W) < 1 — S NS (W, k). 

For the other direction, we first show using linear programming duality that for any p, 


™x..Pi-i/k(»-v,P-W) = min \ p(x)W{y\x)T(x, y) . (65) 

v£D(Y) ' T:XxY ->[ 0 , 1 ] ' 

V 2/> V(x)T(x,y)>l-^ X,V 


In order to see this, observe that /3 a (p ■ v, p • W) is a linear program and thus using duality can also be 
written as a maximization program. In fact. 


f3 a (p ■ v, p ■ W) = max \\a — > A 2 (x,y) 

Ai>0,A2(x,y)>0 1 

n(x)W(y\x)+\ 2 (x,y)>\ 1 ix(x)i'(y) x,y 


( 66 ) 


As a result. 


max B a (p ■ v, p ■ W) 
veD (V) 


max 

Ai>0,A2(a:,2/)>0 

y(x)W (y\x)+\ 2 {x,y)>Xin(x)v(y) 

vty)>o,T,yV(y) =1 


Ain — E A 2 (x,y) 


x,y 


a 


max 

N(y)>0,A 2 (x,y)>0 
^(x)W(j/|x)+A 2 (x,j/)>Ai(y)/i(x) y 


~^2Mx,y) , 


x,y 


(67) 


( 68 ) 


where we simply set Ai(y) = \\u(y). To conclude the proof of ( |65] l, it suffices to observe that this last 
expression is nothing but the dual program for the right hand side of ( |65| . 

Now given a distribution p on X and a test T that satisfies the constraint on the right rand side of ( [65] ), 
we define p x = k ■ p(x) and r XjV = k ■ p(x){ 1 — T(x,y)). Then the condition P( x )T{x,y) > 1 — ^ 
translates to ^ X p{x) — | J2 x r x,y > 1 — In other words, Yh x r x,y < 1- In addition, we clearly have 
Ylx Px = k and r x , y < p x . This concludes the proof of the claim. □ 
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